Welcome to the Apps project! To give you a taste of your future career, we're going to walk through exactly the kind of notebook that you'd write as a data scientist. In the process, we'll be sure to signpost the general framework for our investigation - the Data Science Pipeline - as well as give reasons for why we're doing what we're doing. We're also going to apply some of the skills and knowledge you've built up in the previous unit when reading Professor Spiegelhalter's The Art of Statistics (hereinafter AoS).
So let's get cracking!
Brief
Did Apple Store apps receive better reviews than Google Play apps?
platform
column to both the Apple
and the Google
dataframesNaN
valuesplatform
) In this case we are going to import pandas, numpy, scipy, random and matplotlib.pyplot
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# scipi is a library for statistical tests and visualizations
from scipy import stats
# random enables us to generate random numbers
import random
Let's download the data from Kaggle. Kaggle is a fantastic resource: a kind of social medium for data scientists, it boasts projects, datasets and news on the freshest libraries and technologies all in one place. The data from the Apple Store can be found here and the data from Google Store can be found here. Download the datasets and save them in your working directory.
# Now that the files are saved, we want to load them into Python using read_csv and pandas.
# Create a variable called google, and store in it the path of the csv file that contains your google dataset.
# If your dataset is in the same folder as this notebook, the path will simply be the name of the file.
google = 'googleplaystore.csv'
# Read the csv file into a data frame called Google using the read_csv() pandas method.
Google = pd.read_csv(google)
# Using the head() pandas method, observe the first three entries.
Google.head(3)
# Create a variable called apple, and store in it the path of the csv file that contains your apple dataset.
apple = 'AppleStore.csv'
# Read the csv file into a pandas DataFrame object called Apple.
Apple = pd.read_csv(apple)
# Observe the first three entries like you did with your other data.
Apple.head(3)
From the documentation of these datasets, we can infer that the most appropriate columns to answer the brief are:
Category
# Do we need this?Rating
Reviews
Price
(maybe)prime_genre
# Do we need this?user_rating
rating_count_tot
price
(maybe)Let's select only those columns that we want to work with from both datasets. We'll overwrite the subsets in the original variables.
# Subset our DataFrame object Google by selecting just the variables ['Category', 'Rating', 'Reviews', 'Price']
Google = Google[['Category', 'Rating', 'Reviews', 'Price']]
# Check the first three entries
Google.head(3)
# Do the same with our Apple object, selecting just the variables ['prime_genre', 'user_rating', 'rating_count_tot', 'price']
Apple = Apple[['prime_genre', 'user_rating', 'rating_count_tot', 'price']]
# Let's check the first three entries
Apple.head(3)
Types are crucial for data science in Python. Let's determine whether the variables we selected in the previous section belong to the types they should do, or whether there are any errors here.
# Using the dtypes feature of pandas DataFrame objects, check out the data types within our Apple dataframe.
# Are they what you expect?
Apple.dtypes
This is looking healthy. But what about our Google data frame?
# Using the same dtypes feature, check out the data types of our Google dataframe.
Google.dtypes
Weird. The data type for the column 'Price' is 'object', not a numeric data type like a float or an integer. Let's investigate the unique values of this column.
# Use the unique() pandas method on the Price column to check its unique values.
Google.Price.unique()
Aha! Fascinating. There are actually two issues here.
Everyone
. That is a massive mistake! Let's address the first issue first. Let's check the datapoints that have the price value Everyone
# Let's check which data points have the value 'Everyone' for the 'Price' column by subsetting our Google dataframe.
# Subset the Google dataframe on the price column.
# To be sure: you want to pick out just those rows whose value for the 'Price' column is just 'Everyone'.
Google[Google.Price == 'Everyone']
Thankfully, it's just one row. We've gotta get rid of it.
# Let's eliminate that row.
# Subset our Google dataframe to pick out just those rows whose value for the 'Price' column is NOT 'Everyone'.
# Reassign that subset to the Google variable.
# You can do this in two lines or one. Your choice!
Google = Google[Google.Price != 'Everyone']
# Check again the unique values of Google
Google.Price.unique()
Our second problem remains: I'm seeing dollar symbols when I close my eyes! (And not in a good way).
This is a problem because Python actually considers these values strings. So we can't do mathematical and statistical operations on them until we've made them into numbers.
# Let's create a variable called nosymb.
# This variable will take the Price column of Google and apply the str.replace() method.
nosymb = Google.Price.str.replace('$','')
# Now we need to do two things:
# i. Make the values in the nosymb variable numeric using the to_numeric() pandas method.
# ii. Assign this new set of numeric, dollar-sign-less values to Google['Price'].
# You can do this in one line if you wish.
Google.Price = pd.to_numeric(nosymb)
nosymb.unique()
Now let's check the data types for our Google dataframe again, to verify that the 'Price' column really is numeric now.
# Use the function dtypes.
Google.Price.unique()
Notice that the column Reviews
is still an object column. We actually need this column to be a numeric column, too.
# Convert the 'Reviews' column to a numeric data type.
Google.loc['Reviews'] = pd.to_numeric(Google.Reviews)
# Let's check the data types of Google again
Google.dtypes
platform
column to both the Apple
and the Google
dataframes¶Let's add a new column to both dataframe objects called platform
: all of its values in the Google dataframe will be just 'google', and all of its values for the Apple dataframe will be just 'apple'.
The reason we're making this column is so that we can ultimately join our Apple and Google data together, and actually test out some hypotheses to solve the problem in our brief.
# Create a column called 'platform' in both the Apple and Google dataframes.
# Add the value 'apple' and the value 'google' as appropriate.
Apple['platform'] = 'Google'
Google['platform'] = 'Apple'
Since the easiest way to join two datasets is if they have both:
Apple
so that they're the same as the ones of Google
, or vice versa.In this case, we're going to change the Apple
columns names to the names of the Google
columns.
This is an important step to unify the two datasets!
# Create a variable called old_names where you'll store the column names of the Apple dataframe.
# Use the feature .columns.
old_names = Apple.columns
# Create a variable called new_names where you'll store the column names of the Google dataframe.
new_names = Google.columns
# Use the rename() DataFrame method to change the columns names.
Apple = Apple.rename(columns = dict(zip(old_names,new_names)))
Apple
Let's combine the two datasets into a single data frame called df
.
# Let's use the append() method to append Apple to Google.
Google = Google.append(Apple)
# Using the sample() method with the number 12 passed to it, check 12 random points of your dataset.
Google.sample(12)
As you can see there are some NaN
values. We want to eliminate all these NaN
values from the table.
# Lets check first the dimesions of df before droping `NaN` values. Use the .shape feature.
print(Google.shape)
# Use the dropna() method to eliminate all the NaN values, and overwrite the same dataframe with the result.
Google.dropna(inplace=True)
# Check the new dimesions of our dataframe.
print(Google.shape)
Apps that haven't been reviewed yet can't help us solve our brief.
So let's check to see if any apps have no reviews at all.
# Subset your df to pick out just those rows whose value for 'Reviews' is equal to 0.
# Do a count() on the result.
df = Google[Google.Reviews == 0]
df.count()
929 apps do not have reviews, we need to eliminate these points!
# Eliminate the points that have 0 reviews.
Google = Google[Google.Reviews != 0]
platform
)¶What we need to solve our brief is a summary of the Rating
column, but separated by the different platforms.
# To summarize analytically, let's use the groupby() method on our df.
Google.groupby('platform').mean()
Interesting! Our means of 4.049697 and 4.191757 don't seem all that different! Perhaps we've solved our brief already: there's no significant difference between Google Play app reviews and Apple Store app reviews. We have an observed difference here: which is simply (4.191757 - 4.049697) = 0.14206. This is just the actual difference that we observed between the mean rating for apps from Google Play, and the mean rating for apps from the Apple Store. Let's look at how we're going to use this observed difference to solve our problem using a statistical test.
Outline of our method:
If you want to look more deeply at the statistics behind this project, check out this resource.
Let's also get a visual summary of the Rating
column, separated by the different platforms.
A good tool to use here is the boxplot!
# Call the boxplot() method on our df.
Google.boxplot(by='platform',column ='Rating')
plt.tight_layout();
Here we see the same information as in the analytical summary, but with a boxplot. Can you see how the boxplot is working here? If you need to revise your boxplots, check out this this link.
Our Null hypothesis is just:
Hnull: the observed difference in the mean rating of Apple Store and Google Play apps is due to chance (and thus not due to the platform).
The more interesting hypothesis is called the Alternate hypothesis:
Halternative: the observed difference in the average ratings of apple and google users is not due to chance (and is actually due to platform)
We're also going to pick a significance level of 0.05.
Now that the hypotheses and significance level are defined, we can select a statistical test to determine which hypothesis to accept.
There are many different statistical tests, all with different assumptions. You'll generate an excellent judgement about when to use which statistical tests over the Data Science Career Track course. But in general, one of the most important things to determine is the distribution of the data.
# Create a subset of the column 'Rating' by the different platforms.
# Call the subsets 'apple' and 'google'
apple = Google[Google.platform == 'Apple']['Rating']
google = Google[Google.platform == 'Google']['Rating']
# Using the stats.normaltest() method, get an indication of whether the apple data are normally distributed
# Save the result in a variable called apple_normal, and print it out
apple_normal = stats.normaltest(apple)
apple_normal
# Do the same with the google data.
google_normal = stats.normaltest(google)
google_normal
Since the null hypothesis of the normaltest() is that the data are normally distributed, the lower the p-value in the result of this test, the more likely the data are to be non-normal.
Since the p-values is 0 for both tests, regardless of what we pick for the significance level, our conclusion is that the data are not normally distributed.
We can actually also check out the distribution of the data visually with a histogram. A normal distribution has the following visual characteristics:
- symmetric
- unimodal (one hump)
As well as a roughly identical mean, median and mode.
# Create a histogram of the apple reviews distribution
plt.hist(apple);
# Create a histogram of the google data
plt.hist(google);
Since the data aren't normally distributed, we're using a non-parametric test here. This is simply a label for statistical tests used when the data aren't normally distributed. These tests are extraordinarily powerful due to how few assumptions we need to make.
Check out more about permutations here.
# Create a column called `Permutation1`, and assign to it the result of permuting (shuffling) the Rating column
# This assignment will use our numpy object's random.permutation() method
Google['Permutation1'] = np.random.permutation(Google.Rating)
# Call the describe() method on our permutation grouped by 'platform'.
Google.groupby('platform')[['Permutation1']].describe()
# Lets compare with the previous analytical summary:
Google.groupby('platform')[['Rating','Permutation1']].describe().T
Google.groupby('platform')[['Permutation1']].mean()
np.random.seed(42)
# The difference in the means for Permutation1 (0.001103) now looks hugely different to our observed difference of 0.14206.
# It's sure starting to look like our observed difference is significant, and that the Null is false; platform does impact on ratings
# But to be sure, let's create 10,000 permutations, calculate the mean ratings for Google and Apple apps and the difference between these for each one, and then take the average of all of these differences.
# Let's create a vector with the differences - that will be the distibution of the Null.
# First, make a list called difference.
difference = np.empty(10000)
# Now make a for loop that does the following 10,000 times:
# 1. makes a permutation of the 'Rating' as you did above
# 2. calculates the difference in the mean rating for apple and the mean rating for google.
for i in range(10000):
Google['Permutation1'] = np.random.permutation(Google.Rating)
difference[i] = Google.groupby('platform')[['Permutation1']].mean().loc['Apple']-Google.groupby('platform')[['Permutation1']].mean().loc['Google']
# Make a variable called 'histo', and assign to it the result of plotting a histogram of the difference list.
histo = plt.hist(difference)
# Now make a variable called obs_difference, and assign it the result of the mean of our 'apple' variable and the mean of our 'google variable'
obs_difference = np.mean(apple) - np.mean(google)
# Make this difference absolute with the built-in abs() function.
obs_difference = abs(obs_difference)
# Print out this value; it should be 0.1420605474512291.
print(obs_difference)
# Another way per the Datacamp in sec.11.3
np.random.seed(42)
def permutation_sample(data1, data2):
"""Generate a permutation sample from two data sets."""
# Concatenate the data sets: data
data = np.concatenate((data1, data2))
# Permute the concatenated array: permuted_data
permuted_data = np.random.permutation(data)
# Split the permuted array into two: perm_sample_1, perm_sample_2
perm_sample_1 = permuted_data[:len(data1)]
perm_sample_2 = permuted_data[len(data1):]
return perm_sample_1, perm_sample_2
def diff_of_means(data_1, data_2):
"""Difference in means of two arrays."""
# The difference of means of data_1, data_2: diff
diff = np.mean(data_1)-np.mean(data_2)
return diff
def draw_perm_reps(data_1, data_2, func, size=1):
"""Generate multiple permutation replicates."""
# Initialize array of replicates: perm_replicates
perm_replicates = np.empty(size)
for i in range(size):
# Generate permutation sample
perm_sample_1, perm_sample_2 = permutation_sample(data_1, data_2)
# Compute the test statistic
perm_replicates[i] = func(perm_sample_1, perm_sample_2)
return perm_replicates
# Compute difference of mean impact force from experiment: empirical_diff_means
empirical_diff_means = diff_of_means(apple,google)
# Draw 10,000 permutation replicates: perm_replicates
perm_replicates = draw_perm_reps(apple,google,diff_of_means, size=10000)
# Compute p-value: p
p = np.sum(perm_replicates >= empirical_diff_means) / len(perm_replicates)
# Print the result
print('p-value =', p)
histo = plt.hist(perm_replicates)
''' What do we know?
Recall: The p-value of our observed data is just the proportion of the data given the null that's at least as extreme as that observed data.
As a result, we're going to count how many of the differences in our difference list are at least as extreme as our observed difference.
If less than or equal to 5% of them are, then we will reject the Null. '''
So actually, zero differences are at least as extreme as our observed difference!
So the p-value of our observed data is 0.
It doesn't matter which significance level we pick; our observed data is statistically significant, and we reject the Null.
We conclude that platform does impact on ratings. Specifically, we should advise our client to integrate only Google Play into their operating system interface.
The test we used here is the Permutation test. This was appropriate because our data were not normally distributed!
As we've seen in Professor Spiegelhalter's book, there are actually many different statistical tests, all with different assumptions. How many of these different statistical tests can you remember? How much do you remember about what the appropriate conditions are under which to use them?
Make a note of your answers to these questions, and discuss them with your mentor at your next call.